Lightweight Fault-tolerance for Highly Cooperative Distributed Applications
نویسندگان
چکیده
The recent introduction of high-speed networks, faster processors, and the rapid growth of heterogeneous large-scale distributed systems has enabled the development of distributed applications that move beyond the client-server model to truly harness the computational potential of distributed systems. These new applications will be structured around groups of agents that communicate using messages as well as files. Some of these emerging applications will be critical enough to life or business to warrant explicit process replication to achieve high availability. Often, however, explicit replication will be too costly to implement, or, simply, high availability will not be necessary. In these circumstances, the availability of low-overhead fault-tolerance techniques will be crucial to achieving reliability. To address these needs, we are developing lightweight fault-tolerance (LFT), a new low-overhead approach to fault-tolerance for highly cooperative distributed applications. In the first part of this paper, we describe how LFT extends to file communication the causal logging techniques used in message passing. We show how in our approach all the synchronous operations that are currently performed by log-based protocols during file I/O are either eliminated or made asynchronous, therefore removing the opportunities for blocking. Furthermore, we argue that our approach has the potential to enhance the effectiveness of existing rollback recovery techniques for software fault-tolerance. In the second part of the paper, we validate LFT through extensive simulation. Our results indicate that LFT brings the cost of file communication down to the level of message passing, drastically reducing the overhead incurred by fault-tolerant applications in performing file I/O. 1
منابع مشابه
Lightweight Message Logging Protocol for Distributed Sensor Networks
Among a lot of rollback-recovery protocols developed for providing fault-tolerance for long-running distributed applications, sender-based message logging with checkpointing is one of the most lightweight fault-tolerance techniques to be capable of being applied in this field, significantly decreasing high failure-free overhead of synchronous logging by using message sender's volatile memory as...
متن کاملApplication Aware for Byzantine Fault Tolerance
Driven by the need for higher reliability of many distributed systems, various replication-based fault tolerance technologies have been widely studied. A prominent technology is Byzantine fault tolerance (BFT). BFT can help achieve high availability and trustworthiness by ensuring replica consistency despite the presence of hardware failures and malicious faults on a small portion of the replic...
متن کاملImproving the palbimm scheduling algorithm for fault tolerance in cloud computing
Cloud computing is the latest technology that involves distributed computation over the Internet. It meets the needs of users through sharing resources and using virtual technology. The workflow user applications refer to a set of tasks to be processed within the cloud environment. Scheduling algorithms have a lot to do with the efficiency of cloud computing environments through selection of su...
متن کاملOn Applications of Cooperative Security in Distributed Networks
Many applications running on the Internet operate in fully or semi-distributed fashion including P2P networks or social networks. Distributed applications exhibit many advantages over classical client-server models regarding scalability, fault tolerance, and cost. Unfortunately, the distributed system operation also brings many security threats along that challenge their performance and reliabi...
متن کاملTowards Middleware for Fault-Tolerance in Distributed Real-Time and Embedded Systems
Distributed real-time and embedded (DRE) systems often require support for multiple simultaneous quality of service (QoS) properties, such as real-timeliness and fault tolerance, that operate within resource constrained environments. These resource constraints motivate the need for a lightweight middleware infrastructure, while the need for simultaneous QoS properties require the middleware to ...
متن کامل